Estimating multiplicity of infection, allele frequencies, and prevalences accounting for incomplete data

Background Molecular surveillance of infectious diseases allows the monitoring of pathogens beyond the granularity of traditional epidemiological approaches and is well-established for some of the most relevant infectious diseases such as malaria. The presence of genetically distinct pathogenic variants within an infection, referred to as multiplicity of infection (MOI) or complexity of infection (COI) is common in malaria and similar infectious diseases. It is an important metric that scales with transmission intensities, potentially affects the clinical pathogenesis, and a confounding factor when monitoring the frequency and prevalence of pathogenic variants. Several statistical methods exist to estimate MOI and the frequency distribution of pathogen variants. However, a common problem is the quality of the underlying molecular data. If molecular assays fail not randomly, it is likely to underestimate MOI and the prevalence of pathogen variants. Methods and findings A statistical model is introduced, which explicitly addresses data quality, by assuming a probability by which a pathogen variant remains undetected in a molecular assay. This is different from the assumption of missing at random, for which a molecular assay either performs perfectly or fails completely. The method is applicable to a single molecular marker and allows to estimate allele-frequency spectra, the distribution of MOI, and the probability of variants to remain undetected (incomplete information). Based on the statistical model, expressions for the prevalence of pathogen variants are derived and differences between frequency and prevalence are discussed. The usual desirable asymptotic properties of the maximum-likelihood estimator (MLE) are established by rewriting the model into an exponential family. The MLE has promising finite sample properties in terms of bias and variance. The covariance matrix of the estimator is close to the Cramér-Rao lower bound (inverse Fisher information). Importantly, the estimator’s variance is larger than that of a similar method which disregards incomplete information, but its bias is smaller. Conclusions Although the model introduced here has convenient properties, in terms of the mean squared error it does not outperform a simple standard method that neglects missing information. Thus, the new method is recommendable only for data sets in which the molecular assays produced poor-quality results. This will be particularly true if the model is extended to accommodate information from multiple molecular markers at the same time, and incomplete information at one or more markers leads to a strong depletion of sample size.


Prevalence
Next follows the proof of Remark 1.
Proof of Remark 1.Without loss of generality, we derive the observable prevalence of lineage A n .As Qx x x is the probability mass function of x x x, we have The second sum on the right-hand side runs over all configurations with x n = 1, and hence, is the prevalence of lineage A n , which we denote by qn .Let e e e n = (0, . . ., 0, 1) be the configuration corresponding to observing only lineage A n in an infection.According to (5), the probability of configuration x x x = e e e n is Qe e en = 1 e λ − 1 ((1 − ε)(e λpn − 1)) Further, rearrangement of (9) gives By using (5), we realize that the sum on the right-hand side can be rewritten, so that the above equation becomes Because Q0 0 0 is given by (8) we further obtain 1 − qn = ε(e λpn − 1) + 1 qn − 1 e λ − 1 , which after some algebraic manipulations yields qn = ∑ The observable prevalence of lineage A n , qn , is the probability that A n is actually observed in a sample.Since the empty records do not provide any information concerning the lineages present the corresponding infections, it might be more November 23, 2023 2/30 convenient to not account for them in the definition of prevalence, i.e., to condition the observable prevalence on not observing the non-empty records, i.e., x x x ≠ 0 0 0. The conditional prevalence of lineage Since n was arbitrary, it can be replaced by any k = 1, . . ., n − 1.
Super-infections rarely occur if the MOI parameter λ is close to zero.Considering the limit case yields lim λ→0 The nominator and denominator of the right-hand side of (16) both equal zero for λ = 0. Application of l'Hospital's rule gives Hence, if the MOI parameter rate is close to zero, a lineage's prevalence is close to its frequency.

Log-likelihood function
Under the IDM, the log-likelihood function corresponding to the observed dataset X consisting of N independent observations x x x (1) , . . ., x x x (N ) is given by equation ( 13).Now, we simplify the log-likelihood function.Remember, we defined Õ = {0, 1} n and O = {0, 1} n /{0 0 0}, and n x x x as the number of samples in X with configuration x x x.In particular, n 0 0 0 was defined as the number of empty records in X .The log-likelihood function becomes The first and second terms on the right-hand side of (18) are derived respectively as and November 23, 2023 3/30 To simplify (19b), note that the number of samples in the dataset X for which lineage A k is observed was denoted by N k (14).Clearly, Consequently, we have where N + was defined as the number of non-empty records, which in mathematical notation is (cf.eq.16) Hence, (19b) can be rewritten as By noting that N = n 0 0 0 + N + , the log-likelihood function is derived as .

Inverse Fisher information
Because the frequency vector p p p is an element of the n − 1 dimensional simplex, it is convenient to introduce a Lagrange multiplier to maximize the likelihood function (21).
Then the resulting score function needs to be equated to zero.Unfortunately, no closed solution exists for the MLE.Hence, it needs to be derived numerically.A straightforward, but numerically not advisable, approach would be to use a Newton method.This would require all second-order partial derivatives of the (21).The derivation of these partial derivatives is presented here, not merely because of completeness, but because it is convenient to derive the inverse Fisher information from these second derivatives, which retain the Lagrange multiplier as a nuisance parameter (cf.[1]).
November 23, 2023 4/30 To simplify the notation, let For k = 1, . . ., n, the first partial derivatives of the Lagrange function (21) with respect to the model parameters are calculated to be and The second-order partial derivatives, i.e., the entries of the Hessian matrix of the November 23, 2023 5/30 Lagrange function (21) are derived as (ε(e λp l − 1) + 1) 2 ) (ε(e λp l − 1) + 1) p l e λp l ε(e λp l − 1) + 1 ) November 23, 2023 6/30 2 ) and for k, j = 1, . . ., n (k ≠ j), where the order of the derivatives in (22) can be interchanged.The Fisher information is derived by substituting (24) into (22) using Remark 1.After some algebraic manipulation, the entries of the Fisher information are l (e λp l − 1)(ε(e λp l − 1) + 1) November 23, 2023 and for k, j = 1, . . ., n (k ≠ j), where the entries are symmetric.Note that (i) the inverse Fisher information is of primary interest because it yields the asymptotic variance of the estimator and (ii) Lagrange-multiplier is kept as a nuisance parameter.Retaining the Lagrange-multiplier has the conceptional advantage that the structure of the Fisher information is simpler and hence it can be inverted more easily.A straightforward approach is block-wise inversion.Namely, the Fisher information has the following structure where and The Fisher information can be straightforwardly inverted, by applying two blockwise inversion formulae.First, , the matrix A 1 needs to be inverted by blockwise inversion.
Then the blockwise inversion formula that uses the inverse A −1 1 can be applied, which November 23, 2023 8/30 results only in the inversion of 2 × 2 matrices.Due to the complexity of the resulting formulae, this is omitted here.Once the Fisher information is inverted, the row and columns corresponding to the nuisance parameter γ can be disregarded.Let the resulting matrix be denoted by I −1 .The average MOI ψ might be of more interest than the MOI parameter λ.Following the steps in [1] with obvious modifications yields the inverse Fisher information for the parameter vector (ψ, ε, p p p), as where and the entries v i,j denote the entries of the inverse Fisher information.

Incomplete-data model (IDM) as a natural exponential family
First, we derive (31).We introduced f 0 (x x x) = 1 2 n (30) as the base density, which is uniform on the support of x x x.The probability of observing a non-empty record (7a) can be rewritten as Similarly, the probability of observing an empty record is rewritten as For k = 1, . . ., n, let November 23, 2023 9/30 and additionally let and Thus, ( 26) and ( 27) can be rewritten as and Remember, for an empty record and for a non-empty record . By using these facts we see that Therefore, (31) can be written jointly as for all x x x ∈ Õ.
November 23, 2023 10/30 To rewrite (31) into the natural form, we need to show that the function g g g defined by ∈ Ω be arbitrarily chosen.From ( 28) and (29), we have and By manipulating (34a) we obtain while substituting (34c) into (34b) gives Some algebraic manipulations yield Hence, λ is unique for any choice of β β β.Additionally, from (34c) one derives for all k = 1, . . ., n.Since n ∑ k=1 p k = 1 and λ is given by (35), the relation holds.It only remains to show that this equation has exactly one solution for ε.It can be rewritten as November 23, 2023 11/30 Note that the denominator of f (ε) is always positive for any for all k = 1, . . ., n.Consequently, we have Further, it can be easily checked that f ′′ (ε) < 0, which means that f (ε) is in addition strictly concave.Therefore, f (ε) = 0 has a unique solution in (0, min )).The equation (39) can be solved iteratively for ε by a 1-dimensional Newton method.In detail, for an initial value ), the iteration will converge to the desired solution.After, the solution is substituted into (36) the unique lineage frequencies can be derived.

Minimality of the natural exponential family
To show that the natural exponential family (33) is minimal, we prove that there is no affine dependency between the components of z z z, as well as between the components of β β β.This is equivalent to showing that the sets {1, z 0 (x x x), z 1 (x x x), . . ., z n (x x x)} and {1, β 0 (θ θ θ), β 1 (θ θ θ), . . ., β n (θ θ θ)} are linearly independent.To verify the former, we show that at least one of the generalized Wronskians of {1, z 0 (x x x), z 1 (x x x), . . ., z n (x x x)} is non-zero [2,3], namely from (34) and ( 26) we obtain In order to show that the set {1, β 0 (θ θ θ), β 1 (θ θ θ), . . ., β n (θ θ θ)} is linearly independent, we introduce the transformations and for k = 1, . . ., n.This transformation defines a 1-1 map T ∶ Θ → (0, 1) × (1, +∞) n , by where the inverse of T is given by By this transformation, we eliminate one of the redundant parameters, namely, one of the lineage frequencies since they vary on the (n − 1)-dimensional simplex.Now, it suffices to show that the Jacobian of β β β(η η η) has full rank.Some algebraic manipulations yield that (34) can be written in terms of the new parameters as November 23, 2023 13/30 and The Jacobian of β β β(η η η) is We need the following partial derivatives and where the last term is true for l ≠ k.
To determine the rank of J β β β we apply a series of elementary row operations to reduce J β β β to a matrix in row echelon form.Namely, we multiply the (k + 1)-th row with the function and add it to the first row for all k = 1, . . ., n.This yields where ).
Clearly, the matrix (53) is in echelon form and has full rank.

Interpretation of the MLE
Here, we present the proof of Result 1, which gives an intuitive interpretation of the MLE.
Proof of Result 1.The assertion can be proved without explicitly calculating the MLE.Implicitly, the MLE is given by (48) in terms of the parameters of the natural exponential family, β β β.Let Rearrangement of (7c) yields Note that, (48) yields and where by using (43).Moreover, (44) yields Therefore,(57b) can be rewritten as November 23, 2023 15/30 which by (34c), ( 9) and (10) yields which is the third assertion.Furthermore, by using (58) and (59) one can rewrite (57a) as which yields Thus, (56) is derived as proving the first statement.Finally, by using ( 9) and ( 55), (11) can be rewritten as By substituting (61) into the above and using (60) one obtains which proves the second statement.◻

Pathological cases
Here, we switch between the original parametrization θ θ θ = (λ, p p p, ε) and the natural parameters of the exponential family β β β.The following calculations need to be understood in a limiting sense.We distinguish between three cases: (A) n 0 0 0 = 0, (B) (A) Note, n 0 0 0 = 0 is equivalent to N = N + .In this case U U U ∉ int C(U), hence the MLE β lies on the boundary of the parameters space R n+1 .Equivalently, θ θ θ lies on the boundary of the parameter space Θ.The condition, N = N + implies β0 = −∞ according to (48a) or equivalently ε = 0. We distinguish four sub-cases.
Because βk = ∞, this yields In terms of the original parameters this translates to which has to be understood in a limit sense, i.e., the likelihood function reaches a supremum in any limit λ → ∞ such that p j → 0 but for all k.Also in this case U U U ∉ int C(U) and the estimate lies on the boundary of the admissible parameter space.Notice that βk = ∞ for one or more k is impossible, because (48b) would then yield N k = N , which would imply n 0 0 0 = 0. Hence, βk = −∞ for all k which also implies β0 = −∞ because by (48a) we have This translates into λ = 0 since not all p k can vanish.However, the MLE for ε is found in the limit λ → 0. By replacing β β β in (48a) with its equivalent in (28) in terms of the original parametrization, we derive that the estimate satisfies 1 − e −λ (68) in the limit λ → 0. The limit is derived by applying the l'Hospital's rule to the right -hand side of (68), i.e., November 23, 2023 17/30 Therefore, ε = n 0 0 0 N .Additionally, by using ( 67) and (48b), we derive which again by replacing β β β with its equivalent in (28) gives Taking the limit λ → 0 and applying l'Hospital's rule yields pk = N k N + .
We will distinguish between β0 = 0 and β0 > 0, for which the former results in a limit case in terms of the original parameters.Notably, β 0 = 0 is not at the boundary of the natural parameter space R n+1 , however, it maps into a boundary point of the original parameter space Θ.In other words, for β 0 = 0 and arbitrary β k ∈ R for all k, there exists no θ θ θ ∈ Θ such that g g g(θ θ θ) = β β β.We show that this case occurs if (1 + e β l ).We recognize (67) yields Substituting in (48b) gives which implies Hence, we derive If this equation is substituted into (48a), one obtains November 23, 2023 18/30 For β0 = 0 one obtains Furthermore, for β0 = 0, (48b) yields for any k = 1, . . ., n.In this case, βk = −∞ is impossible, as it would result in N k = 0.Moreover, βk = ∞ is also impossible since it would yield N k = N or equivalently n 0 0 0 = 0. Notice that β0 = 0 translates into λ = ∞.Formally, replacing β k in (78), with the expression in (34c) implies that must hold for all k.Because λ = ∞, this holds only if ε = 1 − N k N or pk = 0, and the attained and a supremum is found in the limit λ → ∞.In terms of original parameters, implies that the supremum is reached in any limit λ → ∞ such that Since λ > 0 and p j ≥ 0, the above relations imply ).In summary, the supremum is attained in any limit λ → ∞ such that subject to the constraint The case β0 = 0 holds hence under the condition ), in which case the supremum (80) is reached.
We prove that this yields a contradiction.We have This inequality in combination with (76) gives This can be rearranged as Since we assumed β 0 < 0, we can conclude which after rearrangement yields contradicting the assumption.Therefore, ) results in β0 > 0, for which we cannot find a solution in the original parameter space Θ.Hence, the natural parameter space needs to be restricted to β 0 ≤ 0. Since no maximum is attained in R − × R n , a boundary maximum must be attained at β 0 = 0, which yields exactly the limit case (80b).◻

The EM Algorithm
The EM algorithm finds the maximum-likelihood estimates of the model parameters θ θ θ by a two-step iterative procedure.Given an initial value θ θ θ (0) , the algorithm generates a sequence θ θ θ (t) until it converges to the MLE θ θ θ.The convergence is imposed by requiring that, e.g., the Euclidean distance, difference of consecutive values of the sequence, i.e., || 2 is smaller than some small but arbitrary threshold δ, e.g., δ = 10 −8 .As before, let X be a dataset of size N containing observation vectors x x x (1) , . . ., x x x (N ) , which can be regarded as observed information corresponding to the unobserved MOI configurations m m m (1) , . . ., m m m (N ) .For each j we have x x x (j)  = (x Let M denote the unobserved dataset corresponding to X .In each iteration, the current parameter choice is θ θ θ (t)  = (λ (t) , p p p (t) , ε (t) ).First, the expectation step is carried out followed by the maximization step.

Expectation step
In step t, first we calculate the conditional expectation of the Q-function which is the expectation of the log-likelihood function of the model parameters θ θ θ given the November 23, 2023 20/30 unobservable M (and hence the observable data X ) with respect to the conditional probability of M conditioned on the observable data X and the current parameter choice θ θ θ (t) .In mathematical terms, the Q-function is defined as Note that the parameter choice in step t in the Q-function, θ θ θ (t) , is fixed, while the parameters θ θ θ are considered variables.We can also write the Q-function as follows ) = E m m m (1) ,...,m m m (N ) | x x x (1) ,...,x x x (N ) ,θ θ θ (t) ( log P ( m m m (1) , . . ., m m m (N ) , x x x (1) , . . ., x x x (N ) | θ θ θ )).
Since the infections are independent and have the same probability mass function, the above leads to From marginalization, it is easily seen that Hence, the Q-function becomes Deriving the Q-function is lengthy.Let and x x x (j) ≠ 0 0 0. Additionally, let m k be the MOI of sample j, and y y y (j) ∶= sign ( m m m (j) ), i.e., y k ) for k = 1, . . ., n.Hence, y y y (j) indicates the actual absence/presence of lineages in the j-th infection.Clearly x x x (j)  ⪯ y y y (j) .We have Because y y y (j) fully determines x x x (j) , we have P ( x x x (j) | y y y (j) , m m m (j) , m (j) , θ θ θ ) = P ( x x x (j) | y y y (j) , θ θ θ ).
Thus, we can rewrite (88) as The first term on the right-hand side is the probability of observed data x x x (j) given that the true absence/presence of lineages are y y y (j) (6).The second term is the probability that m k of the infecting lineages within the infection is the lineage A k given that the MOI is m (j) .This is a multinomial distribution and is given by n n , (cf.[4]).The third term is the probability of MOI m which is a conditional Poisson distribution and is specified in (1).Therefore, (92) becomes Thus, rewriting (87) by using (93) gives where is independent of θ θ θ.
To simplify ) the conditional expectation needs to be calculated explicitly.For the following calculations, we drop the super scripts j and consider x x x to be an arbitrary non-empty record in X .Following the definition of conditional expectation, we derive November 23, 2023 22/30 and ). (95c) By rearranging (6) we obtain Let We can rewrite S 1 as November 23, 2023 23/30 As above (2), let A(x x x) = {k | x k = 0}.We can rewrite S 1 as where .
Separating the other sum and using the identity for a h = ε (t) (e λ (t) p (t) h − 1) gives .
Using the identity (98) in the last term and taking the derivative yields Finally, similar calculations yield , (for a proof of the latter see [4]).Hence, November 23, 2023 25/30 Therefore, (95) becomes and So far we derived the required terms conditioned on a non-empty record x x x.It is not difficult to see that for an empty record x x x = 0 0 0 we have = 0 0 0. Note that equation ( 86) is derived as x x x (j) ≠0 0 0 Q j ( θ θ θ | θ θ θ (t) ) + ∑ x x x (j) =0 0 0 where the first sum on the right-hand side runs over all non-empty records and the second sum runs over all empty records in X .By using k , and that N + and n 0 0 0 are respectively the number of non-empty and empty records, we derive ) =λ where 2 is a constant and independent of θ θ θ, and and

N
k > N and 0 < N k < N for all k.Because the likelihood function of the IDM for ε = 0 coincides with that of the OM, the results of[4] imply λ > 0 and pk > 0 for all k.

N( 1 + 3 ) l=1 ( 1 +
k = N but 0 < N k < N for all k, βk = ∞ is impossible because (48b) would imply N k = N .Hence, βk = −∞ for all k, and (48b) implies n ∏ l=1 e βl ) = 1 or e βk = 0 for all k, since e β0 = 0.This translates into λ = 0 since not all p k can vanish.From this e βk (1 + e βl ) e βl (1 + e βk ) = N k N l follows, which translates into pk pl = N k N l , or pk = N k Assume N k = N and N j > 0 for at least one j ≠ k.By equating N k = N in (48b), some algebraic manipulations yields n ∏ e βl ) = 1 + e βk .The latter implies βk = ∞, or λ = ∞ because ε = 0. (Note that βl = −∞ for all l ≠ k, is impossible because it would imply N l = 0, a contradiction.)Furthermore, if for lineage A j we have N j < N ,

subject to the constraint n ∑ l=1 pl = 1 .N
In particular, in the case that N k = N for two or more k, the supremum of the likelihood function is a continuum.(A.4)If N k ≠ 0 for only one k, obviously the likelihood function attains its maximum for any λ ≥ 0 and pk = 1.k = N + , but 0 < N k < N +

and S 3
− x k ))P ( m m m, x x x | θ θ θ(t) sign m m m=y y y | m m m |=m